High Availability Architecture
Itential Automation Platform (IAP) has been designed with a focus on high availability. This guide will provide detailed component information along with architectural sizing to enable the design and deployment of a highly available IAP.
Note: The following is a living document of information that is continually revised and updated. As such, it may not complement the IAP Installation Guide precisely. We recommend that you check back for additional updates and if any questions, please contact Itential Product Support.
IAP HA Component Overview
Itential Automation Platform
The Itential Automation Platform has been designed to allow for multiple IAP servers to be used in a clustered fashion. These clustered IAP servers will share the same MongoDB "pronghorn" database in order to share storage, and allow for Workflow Engine to run workflow tasks across any number of available IAP servers. In addition, they usually share the same Redis and RabbitMQ servers to provide appropriate communication. Altogether this provides both horizontal scaling when paired with a front-end load-balancer and high availability with redundant servers in the event of a server failure.
Active vs Standby Servers
The Itential Automation Platform supports both Active and Standby Pronghorn servers. In most use cases standby servers are those that do not actively accept user-traffic or work tasks from automations during normal business operations. However they are ready to actively take over these duties should an active IAP Server fail or need to be shutdown.
Active: One or more IAP servers running actively and accepting user traffic and working tasks. If these servers shutdown it will impact automations and user traffic.
Standby: One or more IAP servers that are running, but do not receive incoming user traffic from load balancers. They also do not actively work automation tasks. The
taskWorkerProps.activate:false
parameter is set for these IAP servers to disable working tasks, and load balancers should only send traffic if the active IAP servers go offline.
RabbitMQ
RabbitMQ is an industry standard software messaging broker. RabbitMQ is used in IAP to provide intercommunication between IAP and its applications, as well as allowing IAP to horizontally scale with additional IAP nodes. Single data center high availability is achieved through RabbitMQ clustering and message queue mirroring. In order to incorporate a standby disaster recovery data center, the WAN friendly RabbitMQ Federation Plugin can be implemented.
Shared-Token Redis
The IAP HA architecture uses a group of Master/Slave Redis servers to provide a single location for IAP servers to jointly share IAP tokens created on user login. By having a shared Redis, all IAP servers will share login tokens and allow users to seamlessly switch between clustered IAP Servers without being redirected to the login page. The Master/Slave Redis setup allows for high availability with its use of Redis Sentinel to monitor the Redis group and ensures a Master is always available.
MongoDB
MongoDB is a No-SQL document database with scalability and flexibility in mind. Itential Automation Platform uses Mongo as its main repository for storing data used in its workflow automations and also for many of its applications. MongoDB is also used in HA Architecture by providing a single repository of information for multiple IAP servers working together. MongoDB has extensive scalability and high availability options available. The IAP application utilizes MongoDB's replica set functionality to redundantly store copies of data in order to ensure availability in the event of a failure.
MongoDB utilizes an election style system for electing a primary MongoDB server. This ensures that only one MongoDB server is ever seen as the Master, preventing split-brain issues. The Primary/Master is the only member that can actively be written or read from. Should the primary fail, the remaining servers will fail their heartbeats and begin another election to bring the replica set back online. The number of replica set members determines the amount of redundancy available.
Architecture Sizing
Small: No High Availability
HA-0
: Bare minimum node count for each component. Shared Token Redis is not required as there is only a single IAP node, although a Redis instance is required to be installed locally on the IAP server for token storage.
Note: This architecture is not recommended for Production usage as it provides no high availability. It could be used, however, for non-critical environments such as DEV/TEST, but frequent manual backups are strongly recommended.
Medium: High Availability
HA-3
: Single data center containing clusters of all components, with the exception of Supporting Tools and Automation Gateway as they do not support clustering. Prospector and Policy Engine (Supporting Tools) run individual instances. Automation Gateway supports 5000 devices per node and additional nodes can be added, as required.
Large: High Availability & Disaster Recovery
HA-6
: Highly available active data center that virtually matches the layout of the previousMedium: High Availability
architecture size. In addition, a matching standby data center is available to support disaster recovery scenarios. It is recommended to deploy the MongoDB Arbiter node in a third data center. The RabbitMQ Federation Plugin can be used to transfer data between the Active and Standby RabbitMQ clusters securely across a WAN.